218 ◾ Bioinformatics
strand forming an initial DNA–RNA hybrid from which the new mRNA transcript is
separated. The purpose of this exercise is to investigate the promoter regions in the gene
targeted by the DNA-directed RNA polymerase II subunit RPB1 during gene transcrip-
tion. The raw data consists of four single-end FASTQ files generated by Illumina Genome
Analyzer and available at ENCODE database with the accession numbers: ENCFF000XJP,
ENCFF000XJS, and ENCFF000XKD, and the accession number of the input data (control)
is ENCSR000EZM. For the sake of keeping the files organized, we can create a project
directory called “chipseq”, and inside that directory, we can create a subdirectory called
“data” where we can download the FASTQ files as follows:
mkdir chipseq; cd chipseq; mkdir data
wget \
-O “data/ENCFF000XJP_chp1.fastq.gz” \
“https://www.encodeproject.org/files/ENCFF000XJP/@@download/
ENCFF000XJP.fastq.gz”
wget \
-O “data/ENCFF000XJS_chp2.fastq.gz” \
“https://www.encodeproject.org/files/ENCFF000XJS/@@download/
ENCFF000XJS.fastq.gz”
wget \
-O “data/ENCFF000XKD_chp3.fastq.gz” \
“https://www.encodeproject.org/files/ENCFF000XKD/@@download/
ENCFF000XKD.fastq.gz”
wget \
-O “data/ENCFF000XGP_inp0.fastq.gz” \
“https://www.encodeproject.org/files/ENCFF000XGP/@@download/
ENCFF000XGP.fastq.gz”
The four files will be downloaded into the “data” directory. The four files are
ENCFF000XJP_chp1.fastq.gz, ENCFF000XJS_chp2.fastq.gz, ENCFF000XKD_chp3.fastq.
gz, and ENCFF000XGP_inp0.fastq.gz. The latter is the FASTQ file that contains the input
or control data.
6.3.2 Quality Control
The quality control is an important step in all sequencing data analysis workflows. The
quality of the reads in the FASTQ file can be assessed by an appropriate program like
FastQC to check the read quality, technical sequences such as adaptor dimer and PCR
duplicate reads, GC-content bias, and other sequencing biases. We should try to fix any
potential problem as possible before proceeding to the mapping step. Refer to Chapter 1 for
the quality assessment metrics and the approaches to fix the potential faults.
cd data
fastqc \
ENCFF000XJP_chp1.fastq.gz \
ENCFF000XJS_chp2.fastq.gz \